Final Project: Regressing Cost and PDC Levels

Aaron Makubuya and Breton Shindel

Objectives

In this project, you will create a linear regression model for predicting total post-index cost and a logistic regression model for predicting the pdc flag for diabetic patients.

The first step in creating the models is variable selection. As you select your variables, keep in mind that smaller models are easier to interpret and less prone to overfitting.

Utilize scatter and box-plots, statistics tests such as Chi-Square, t-test, Wilcoxon rank-sum test, F-test, ANOVA, and Stepwise regression techniques to assist in variable selection. You can also compare the performance of differ- ent models on the test set by measuring the out-of-sample error.

For the linear regression model, you will need to perform residual analysis to check model assumptions.

Loading and Exploring the Data

In [114]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [115]:
%cd ~/Downloads/
/Users/bretonshindel/Downloads
In [116]:
Data = pd.read_csv("UOP_data_extract _matched_w_key.csv", skiprows=1)
In [117]:
len(Data)
Out[117]:
3550

Structuring the Data

There shouldn't be much work in terms of cleaning, but there are certainly features that we can drop that won't help us with either model. FOr instance, any post-index measures can be dropped because they are taken at the same time as the post-index cost, thus they can't help with making predictions before the cost is incurred, and they can't help with assigning PDC flag because the PDC level is calculated before any post-index measures occur.

We can also drop NaNs because some lodels such as logistic regression won't perform if there are missing values from the features.

In [118]:
AC = Data[Data.drug_class == '*ANTICOAGULANTS*']
AD = Data[Data.drug_class == '*ANTIDIABETICS*']
In [119]:
Data = Data[Data.drug_class == '*ANTIDIABETICS*']
In [120]:
Data = Data.drop('num_ip_post', axis=1)
Data = Data.drop('total_los_post', axis=1)
Data = Data.drop('num_op_post', axis=1)
Data = Data.drop('num_er_post', axis=1)
Data = Data.drop('num_ndc_post', axis=1)
Data = Data.drop('num_gpi6_post', axis=1)
Data = Data.drop('adjust_total_30d_post', axis=1)
Data = Data.drop('generic_rate_post', axis=1)
Data = Data.drop('post_ip_flag', axis=1)
Data = Data.drop('post_er_flag', axis=1)
Data = Data.drop('pdc_cat', axis=1)
Data = Data.drop('post_ip_cost', axis=1)
Data = Data.drop('post_er_cost', axis=1)
Data = Data.drop('post_rx_cost', axis=1)
Data = Data.drop('post_op_cost', axis=1)
Data = Data.drop('post_medical_cost', axis=1)
Data = Data.drop('drug_class', axis=1)
Data = Data.drop('patient_key', axis=1)
Data = Data.drop('pdc', axis=1)
Data = Data.drop('post_total_cost', axis=1)
Data = Data.drop('generic_cost_post', axis=1)
Data = Data.drop('brand_cost_post', axis=1)
Data = Data.drop('ratio_G_total_cost_post', axis=1)
In [121]:
Data[['Age_18_44', 'Age_45_64', 'Age_65_Up']] = pd.get_dummies(Data.age_grpN)
In [122]:
Data[['Male', 'Female']] = pd.get_dummies(Data.sexN)
In [123]:
Data[['Reg_NE', 'Reg_MW', 'Reg_S', 'Reg_W']] = pd.get_dummies(Data.regionN)
In [124]:
#Drop repeating columns now that we have their dummy values
Data = Data.drop('age_grpN', axis=1)
Data = Data.drop('sexN', axis=1)
Data = Data.drop('regionN', axis=1)
In [125]:
Data = Data.dropna()
In [126]:
from sklearn.preprocessing import StandardScaler

#Standardize the Matrix values to prepare for feature processing
sc = StandardScaler()
tst = sc.fit_transform(Data['Male'])
tst
Out[126]:
array([-0.99637022,  1.003643  ,  1.003643  , ..., -0.99637022,
       -0.99637022, -0.99637022])

This really messes things up, so we will not be standardizing any discrete values.

In [127]:
Data['idx_copay'] = sc.fit_transform(Data['idx_copay'])
Data['log_idx_copay'] = sc.fit_transform(Data['log_idx_copay'])
Data['pre_ip_cost'] = sc.fit_transform(Data['pre_ip_cost'])
Data['pre_er_cost'] = sc.fit_transform(Data['pre_er_cost'])
Data['pre_rx_cost'] = sc.fit_transform(Data['pre_rx_cost'])
Data['pre_op_cost'] = sc.fit_transform(Data['pre_op_cost'])
Data['pre_total_cost'] = sc.fit_transform(Data['pre_total_cost'])
Data['pre_medical_cost'] = sc.fit_transform(Data['pre_medical_cost'])
Data['adjust_total_30d'] = sc.fit_transform(Data['adjust_total_30d'])
Data['generic_rate'] = sc.fit_transform(Data['generic_rate'])
Data['log_pre_ip_cost'] = sc.fit_transform(Data['log_pre_ip_cost'])
Data['log_pre_er_cost'] = sc.fit_transform(Data['log_pre_er_cost'])
Data['log_pre_op_cost'] = sc.fit_transform(Data['log_pre_op_cost'])
Data['log_pre_rx_cost'] = sc.fit_transform(Data['log_pre_rx_cost'])
Data['generic_cost'] = sc.fit_transform(Data['generic_cost'])
Data['brand_cost'] = sc.fit_transform(Data['brand_cost'])
Data['ratio_G_total_cost'] = sc.fit_transform(Data['ratio_G_total_cost'])
In [128]:
pd.value_counts(Data['Reg_W'])
Out[128]:
0    1535
1     115
Name: Reg_W, dtype: int64
In [129]:
pairs = pd.DataFrame([Data.idx_copay, Data.log_idx_copay, Data.pre_ip_cost, Data.pre_er_cost, Data.pdc_80_flag])
In [130]:
%matplotlib inline

sns.set_context('poster')
sns.mpl.rc("figure", figsize=(12,6))
In [131]:
sns.pairplot(Data, size=3, vars=["idx_copay", "log_idx_copay", "pre_ip_cost", "pre_er_cost", "pre_rx_cost",
                                 "pre_op_cost", "pre_medical_cost", "pre_total_cost", "adjust_total_30d",
                                 "generic_rate", "log_pre_ip_cost", "log_pre_er_cost", 'log_pre_op_cost',
                                 'log_pre_rx_cost', "generic_cost", "brand_cost", "ratio_G_total_cost"],
             hue='pdc_80_flag');
/Users/bretonshindel/anaconda/lib/python3.4/site-packages/matplotlib/__init__.py:892: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
In [132]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 20)
In [133]:
print('Total features:', Data.shape[1])
print('Total observations:', Data.shape[0])

Data.head(3)
Total features: 78
Total observations: 1650
Out[133]:
idx_copay idx_prodtypeN idx_paytypN age_cat log_idx_copay ALCOHOL_DRUG ASTHMA CARDIAC_ARRYTHMIA CARDIAC_VALVULAR CEREBROVASCULAR CHRONIC_KIDNEY CHRONIC_PAIN_FIBRO CHF COPD DEMENTIA DEPRESSION DIABETES DYSLIPIDEMIA EPILEPSY_SEIZURE HEPATITIS HIV_AIDS HYPERTENSION LIVER_GALLBLADDER_PANCREAS MI_CAD OSTEOARTHRITIS PARALYSIS PEPTIC_ULCER PERIPHERAL_VASCULAR RENAL_FAILURE RHEUMATOLOGIC SCHIZOPHRENIA SLEEP_DISORDERS SMOKING THYROID Solid_Tumor Metastatic Leukemia_Lymphoma Other_Cancer Cancer_In_Situ pre_CCI pre_ip_cost pre_er_cost pre_rx_cost pre_op_cost pre_total_cost pre_medical_cost num_ip total_los num_op num_er num_ndc num_gpi6 adjust_total_30d generic_rate pre_ip_flag pre_er_flag log_pre_ip_cost log_pre_er_cost log_pre_op_cost log_pre_rx_cost pre_total_cat numofgen numofbrand generic_cost brand_cost ratio_G_total_cost numofgen_post numofbrand_post pdc_80_flag Age_18_44 Age_45_64 Age_65_Up Male Female Reg_NE Reg_MW Reg_S Reg_W
1 1.112443 2 1 2 0.943165 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -0.203079 -0.202007 0.585524 -0.120626 -0.054834 -0.238604 0 0 27 0 15 6 -0.146839 -2.756231 0 0 -0.273048 -0.446072 0.395616 1.148968 7 2 13 -0.527138 0.726295 -1.356582 2 13 0 1 0 0 0 1 0 0 1 0
4 -0.556196 2 4 2 -0.170128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -0.203079 -0.202007 -0.212157 -0.390342 -0.388516 -0.360230 0 0 3 0 17 4 0.103884 0.861590 0 0 -0.273048 -0.446072 -0.074529 0.390026 4 17 0 1.058492 -0.373082 1.083909 14 2 1 1 0 0 1 0 0 0 1 0
5 -0.370129 3 1 2 0.165475 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -0.203079 -0.202007 -0.234585 -0.263932 -0.342655 -0.303226 0 0 6 0 10 6 -0.594129 0.666417 0 0 -0.273048 -0.446072 0.214910 0.340297 4 9 1 0.314504 -0.291908 0.203710 6 0 0 1 0 0 1 0 0 0 1 0
In [134]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()

Defining our Tests

We will be working with Chi-Squate Tests, T-Tests, Wilcoxon Rank-Sum Tests, F-Tests, ANOVA Tests, and Stepwise Regression Techniques to determine which features are most important. Each test is uniquely qualified for different types of data, and provides us different findings on the relationships between features, and the meaning of each p-value. We will also be working with Logistic Regressions, Linear Regressions, and Residual Analysis. We will define each test and model below before proceeding further.

Linear Regression

An approach for modeling the relationship between a scalar dependent variable y, and one or more independent variables X. In this case we will be working with multiple linear regression because we have many correlated features that we want to include in our model of feature y. The relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Linear regression focuses on the conditional probability distribution of y given X, rather than a join probability distribution of y and X. One of the main goals of linear regression is prediction of the dependent variable, but it can also be used to assess the relationship between y and Xj, to assess which Xj may have no relationship with y at all, and to identify which subsets of the Xj contain redundant information about y.

Conditions:

  • The dependent feature y has a linear relationship to the independent feature Xj. This holds true if a scatter plot is linear and a residuals plot is mostly random
  • The y values are normally distributed
  • Variables are only quantitative. Note that many features in our data are binary values. These are categorical features disguised as quantitative, and thus we can't perform regression on these variables because they do not meet the quantitative variables condition
  • No missing values in your observations

Logistic Regression

The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). As such it is not a classification method. Logistic regression can be seen as a special case of generalized linear model and thus analogous to linear regression, but is based on different assumptions. First, the conditional distribution y \mid x is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes.

Conditions:

  • Binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. Reducing an ordinal or metric variable to dichotomous level loses a lot of information, which makes this test inferior compared to ordinal logistic regression
  • Because logistic regression assumes that P(Y=1) is the probability of the event occurring, it is necessary that the dependent variable is coded accordingly. That is, for a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
  • The model should be fitted correctly. Neither over fitting nor under fitting should occur. That is only the meaningful variables should be included, but also all meaningful variables should be included. A good approach to ensure this is to use a stepwise method to estimate the logistic regression.
  • The error terms need to be independent. Logistic regression requires each observation to be independent. That is that the data-points should not be from any dependent samples design, e.g., before-after measurements, or matched pairings
  • It requires quite large sample sizes. Because maximum likelihood estimates are less powerful than ordinary least squares (e.g., simple linear regression, multiple linear regression)

Residual Analysis

Using residual plots, you can assess whether the observed error (residuals) is consistent with stochastic error. In regression models, You shouldn’t be able to predict the error for any given observation. And, for a series of observations, you can determine whether the residuals are consistent with random error.

Chi_Square Test

A chi square (X2) statistic is used to investigate whether distributions of categorical variables differ from one another. The Chi Square statistic compares the tallies or counts of categorical responses between two (or more) independent groups. (note: Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.)

Conditions:

  • Discrete Data

T_Test

Student's t test for independent samples is used to determine whether two samples were drawn from populations with different means. Using a one-tailed P-value assumes you already know before you even see the values which group should be larger and which should be smaller. Since this is usually not true, we almost always use a two-tailed test. If you have more than two groups, you shouldn't use multiple t-tests because the error adds up (see familywise error) and thus you increase your chances of finding an effect when there really isn't one (type 1 error). Therefore when you have more than two groups to compare e.g. in a drugs trial when you have a high dose, low does and a placebo group (3 groups), you use ANOVA to examine whether there any differences between the groups.

Conditions:

  • Normal, continuous data

Interpreting the p-value: The null hypothesis states that the two samples come from the same group. If the p-value falls below the significance level, we reject the null hypothesis, and assume that the samples come from two different populations.

Conditions:

Wilcoxon Rank-Sum

When the requirements for the t-test for two independent samples are not satisfied, the Wilcoxon Rank-Sum non-parametric test can often be used provided the two independent samples are drawn from populations with an ordinal distribution. It studies whether or not different samples come from the same population. If one observation is made at random from each population (call them x0 and y0), then the probability that x0 > y0 is the same as the probability that x0 < y0, and so the populations for each sample have the same medians. Since it compares rank sums, the Wilcoxon Rank-Sum test is more robust than the t-test as it is less likely to indicate spurious results based on the presence of outliers. Even for large samples where the assumptions for the t-test are met, the Wilcoxon Rank-Sum test is only a little less efficient than the t-test.

Interpreting the p-value:The null hypothesis states that the observations come from the same population. If our p-value falls below a specified alpha level, we reject the null hypothesis, and assume that the samples come from two different populations

Conditions:

F-Test

The F-distribution is formed by the ratio of two independent chi-square variables divided by their respective degrees of freedom. Since F is formed by chi-square, many of the chi-square properties carry over to the F distribution. The F-test is designed to test if two population variances are equal, and does this by comparing the ratio of two variances. If the variances are equal, the ratio of the variances will be 1. The F-distribution is commonly used in ANOVA, and it is the ratio of two chi-square distributions, and hence is right skewed. It has a minimum of 0, but no maximum value (all values are positive). The peak of the distribution is not far from 0. The F-test is the ratio of systematic variance : unsystematic variance, so higher scores are better.

Interpreting the p-value: The null hypothesis states that the two sample variances are equal. If our p-value falls below the assigned alpha level, we reject the null hypothesis and assume that our sample variances are not equal.

ANOVA Test

Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences among group means and their associated procedures (such as "variation" among and between groups). ANOVA can be used in cases where there are more than two groups. The t-test can only be used to test differences between two means. When there are more than two means, but conducting such multiple t-tests can lead to complications. In such circumstances we use ANOVA. This technique is used whenever an alternative procedure is needed for testing hypotheses concerning means when there are several populations.

Stepwise Regression

Stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-tests or t-tests. Properly used, stepwise regression puts more power and information at your fingertips than does the ordinary multiple regression option, and it is especially useful for sifting through large numbers of potential independent variables and/or fine-tuning a model by poking variables in or out. Improperly used, it may converge on a poor model while giving you a false sense of security.

Analyzing PDC Flag Value with Logistic Regression

In [135]:
Data.insert(0, 'pdc_flag', Data.pdc_80_flag)
In [136]:
Data = Data.drop('pdc_80_flag', axis=1)
In [137]:
Data.head(3)
Out[137]:
pdc_flag idx_copay idx_prodtypeN idx_paytypN age_cat log_idx_copay ALCOHOL_DRUG ASTHMA CARDIAC_ARRYTHMIA CARDIAC_VALVULAR CEREBROVASCULAR CHRONIC_KIDNEY CHRONIC_PAIN_FIBRO CHF COPD DEMENTIA DEPRESSION DIABETES DYSLIPIDEMIA EPILEPSY_SEIZURE HEPATITIS HIV_AIDS HYPERTENSION LIVER_GALLBLADDER_PANCREAS MI_CAD OSTEOARTHRITIS PARALYSIS PEPTIC_ULCER PERIPHERAL_VASCULAR RENAL_FAILURE RHEUMATOLOGIC SCHIZOPHRENIA SLEEP_DISORDERS SMOKING THYROID Solid_Tumor Metastatic Leukemia_Lymphoma Other_Cancer Cancer_In_Situ pre_CCI pre_ip_cost pre_er_cost pre_rx_cost pre_op_cost pre_total_cost pre_medical_cost num_ip total_los num_op num_er num_ndc num_gpi6 adjust_total_30d generic_rate pre_ip_flag pre_er_flag log_pre_ip_cost log_pre_er_cost log_pre_op_cost log_pre_rx_cost pre_total_cat numofgen numofbrand generic_cost brand_cost ratio_G_total_cost numofgen_post numofbrand_post Age_18_44 Age_45_64 Age_65_Up Male Female Reg_NE Reg_MW Reg_S Reg_W
1 0 1.112443 2 1 2 0.943165 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -0.203079 -0.202007 0.585524 -0.120626 -0.054834 -0.238604 0 0 27 0 15 6 -0.146839 -2.756231 0 0 -0.273048 -0.446072 0.395616 1.148968 7 2 13 -0.527138 0.726295 -1.356582 2 13 1 0 0 0 1 0 0 1 0
4 1 -0.556196 2 4 2 -0.170128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 -0.203079 -0.202007 -0.212157 -0.390342 -0.388516 -0.360230 0 0 3 0 17 4 0.103884 0.861590 0 0 -0.273048 -0.446072 -0.074529 0.390026 4 17 0 1.058492 -0.373082 1.083909 14 2 1 0 0 1 0 0 0 1 0
5 0 -0.370129 3 1 2 0.165475 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 -0.203079 -0.202007 -0.234585 -0.263932 -0.342655 -0.303226 0 0 6 0 10 6 -0.594129 0.666417 0 0 -0.273048 -0.446072 0.214910 0.340297 4 9 1 0.314504 -0.291908 0.203710 6 0 1 0 0 1 0 0 0 1 0
In [138]:
PDC_0 = Data[Data.pdc_flag == 0]
PDC_1 = Data[Data.pdc_flag == 1]

Testing the Features

In [139]:
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
In [140]:
from sklearn.linear_model import Lasso, LassoCV, LassoLarsCV, LassoLarsIC, MultiTaskLasso

#clf = linear_model.Lasso(alpha=0.1)
X_Frame = Data.iloc[:, 1:].values
y_Frame = Data.iloc[:, 0].values
Lasso(Data.iloc[:, 1:], Data.iloc[:, 0])
Out[140]:
Lasso(alpha=      idx_copay  idx_prodtypeN  idx_paytypN  age_cat  log_idx_copay  \
1      1.112443              2            1        2       0.943165
4     -0.556196              2            4        2      -0.170128
5     -0.370129              3            1        2       0.165475
8     -0.5...      1      0
3532       0      1      0
3533       0      1      0

[1650 rows x 77 columns],
   copy_X=True,
   fit_intercept=1       0
4       1
5       0
8       0
16      1
18      0
20      0
21      1
22      0
23      0
       ..
3520    0
3522    1
3523    0
3525    0
3527    0
3528    0
3529    1
3530    0
3532    0
3533    1
Name: pdc_flag, dtype: int64,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)
In [141]:
pd.set_option('display.max_rows', 20)

Copay

In [183]:
#Plotting Copay to check for normality
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

import statsmodels.api as sm
sm.qqplot(Data.idx_copay, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Copay',  size=12, weight='bold')

sns.distplot(Data.idx_copay, bins=20)
sns.rugplot(Data.idx_copay, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Copay',  size=12, weight='bold')
plt.show();
In [184]:
from scipy.stats import ttest_ind
from scipy.stats import ranksums

# T-Test:
tstat, idx_copay_p_value = ttest_ind(PDC_0.idx_copay, PDC_1.idx_copay, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, idx_copay_p_valueZ = ranksums(PDC_0.idx_copay, PDC_1.idx_copay)

print("p-Value, t-Test =", idx_copay_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", idx_copay_p_valueZ)
p-Value, t-Test = 0.843155120732
p-Value, Wilcoxon Rank-Sum Test = 0.590366886227

Health Plan Type

In [185]:
Data.pivot_table(['pdc_flag'], index=[Data.idx_prodtypeN, Data.pdc_flag], aggfunc=len).unstack()
Out[185]:
pdc_flag
pdc_flag 0 1
idx_prodtypeN
1 13 13
2 945 614
3 38 27
In [211]:
from scipy.stats import chisquare
from scipy.stats import chi2_contingency

chi2, idx_prodtypeN_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.idx_prodtypeN, Data.pdc_flag],
                               aggfunc=len).unstack())
idx_prodtypeN_p_value = round(idx_prodtypeN_p_value, 5)
print("p-Value = ", idx_prodtypeN_p_value)
p-Value =  0.52026

Primary Payer Type

In [201]:
Data.pivot_table(['pdc_flag'], index=[Data.idx_paytypN, Data.pdc_flag], aggfunc=len).unstack()
Out[201]:
pdc_flag
pdc_flag 0 1
idx_paytypN
1 534 321
2 6 10
3 2 2
4 454 321
In [187]:
chi2, idx_paytypN_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.idx_paytypN, Data.pdc_flag],
                               aggfunc=len).unstack())
idx_paytypN_p_value = round(idx_paytypN_p_value, 5)
print("p-Value = ", idx_paytypN_p_value)
p-Value =  0.0992

Age Group New

In [188]:
chi2, age_cat_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.age_cat, Data.pdc_flag],
                               aggfunc=len).unstack())
age_cat_p_value = round(age_cat_p_value, 5)
print("p-Value = ", age_cat_p_value)
p-Value =  0.0

Log of Copay

In [189]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.log_idx_copay, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Copay',  size=12, weight='bold')

sns.distplot(Data.log_idx_copay, bins=20)
sns.rugplot(Data.log_idx_copay, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Log Copay',  size=12, weight='bold')
plt.show();
In [190]:
# T-Test:
tstat, log_idx_copay_p_value = ttest_ind(PDC_0.log_idx_copay, PDC_1.log_idx_copay, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, log_idx_copay_p_valueZ = ranksums(PDC_0.log_idx_copay, PDC_1.log_idx_copay)

print("p-Value, t-Test =", log_idx_copay_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_idx_copay_p_valueZ)
p-Value, t-Test = 0.0399771553742
p-Value, Wilcoxon Rank-Sum Test = 0.590330429362

Alcohol or Drug Use

In [191]:
#PDC Relationship
chi2, ALCOHOL_DRUG_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.ALCOHOL_DRUG, Data.pdc_flag],
                               aggfunc=len).unstack())
ALCOHOL_DRUG_p_value = round(ALCOHOL_DRUG_p_value, 5)
print("p-Value = ", ALCOHOL_DRUG_p_value)
p-Value =  0.85945

Asthma

In [84]:
chi2, ASTHMA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.ASTHMA, Data.pdc_flag],
                               aggfunc=len).unstack())
ASTHMA_p_value = round(ASTHMA_p_value, 5)
print("p-Value = ", ASTHMA_p_value)
p-Value =  0.97427

Cardiac Arrest

In [85]:
chi2, CARDIAC_ARRYTHMIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.CARDIAC_ARRYTHMIA, Data.pdc_flag],
                               aggfunc=len).unstack())
CARDIAC_ARRYTHMIA_p_value = round(CARDIAC_ARRYTHMIA_p_value, 5)
print("p-Value = ", CARDIAC_ARRYTHMIA_p_value)
p-Value =  0.20695

Cardiac Valcular

In [86]:
chi2, CARDIAC_VALVULAR_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.CARDIAC_VALVULAR, Data.pdc_flag],
                               aggfunc=len).unstack())
CARDIAC_VALVULAR_p_value = round(CARDIAC_VALVULAR_p_value, 5)
print("p-Value = ", CARDIAC_VALVULAR_p_value)
p-Value =  0.23555

Cerebrovasular

In [87]:
chi2, CEREBROVASCULAR_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.CEREBROVASCULAR, Data.pdc_flag],
                               aggfunc=len).unstack())
CEREBROVASCULAR_p_value = round(CEREBROVASCULAR_p_value, 5)
print("p-Value = ", CEREBROVASCULAR_p_value)
p-Value =  0.96935

Chronic Kidney

In [88]:
chi2, CHRONIC_KIDNEY_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.CHRONIC_KIDNEY, Data.pdc_flag],
                               aggfunc=len).unstack())
CHRONIC_KIDNEY_p_value = round(CHRONIC_KIDNEY_p_value, 5)
print("p-Value = ", CHRONIC_KIDNEY_p_value)
p-Value =  0.49982

Chronic Pain/Fibromyalgia

In [89]:
chi2, CHRONIC_PAIN_FIBRO_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.CHRONIC_PAIN_FIBRO, Data.pdc_flag],
                               aggfunc=len).unstack())
CHRONIC_PAIN_FIBRO_p_value = round(CHRONIC_PAIN_FIBRO_p_value, 5)
print("p-Value = ", CHRONIC_PAIN_FIBRO_p_value)
p-Value =  0.47384

Congestive Heart Failure

In [90]:
chi2, CHF_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.CHF, Data.pdc_flag],
                               aggfunc=len).unstack())
CHF_p_value = round(CHF_p_value, 5)
print("p-Value = ", CHF_p_value)
p-Value =  0.76504

COPD

In [91]:
chi2, COPD_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.COPD, Data.pdc_flag],
                               aggfunc=len).unstack())
COPD_p_value = round(COPD_p_value, 5)
print("p-Value = ", COPD_p_value)
p-Value =  0.49016

Dementia/Alzeimers

In [92]:
chi2, DEMENTIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.DEMENTIA, Data.pdc_flag],
                               aggfunc=len).unstack())
DEMENTIA_p_value = round(DEMENTIA_p_value, 5)
print("p-Value = ", DEMENTIA_p_value)
p-Value =  nan

Depression

In [93]:
chi2, DEPRESSION_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.DEPRESSION, Data.pdc_flag],
                               aggfunc=len).unstack())
DEPRESSION_p_value = round(DEPRESSION_p_value, 5)
print("p-Value = ", DEPRESSION_p_value)
p-Value =  0.2444

Diabetes

In [94]:
chi2, DIABETES_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.DIABETES, Data.pdc_flag],
                               aggfunc=len).unstack())
DIABETES_p_value = round(DIABETES_p_value, 5)
print("p-Value = ", DIABETES_p_value)
#Well that makes sense, huh?
p-Value =  1e-05

Dyslipidemia

In [96]:
chi2, DYSLIPIDEMIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.DYSLIPIDEMIA, Data.pdc_flag],
                               aggfunc=len).unstack())
DYSLIPIDEMIA_p_value = round(DYSLIPIDEMIA_p_value, 5)
print("p-Value = ", DYSLIPIDEMIA_p_value)
p-Value =  1e-05

Epilepsy

In [97]:
chi2, EPILEPSY_SEIZURE_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.EPILEPSY_SEIZURE, Data.pdc_flag],
                               aggfunc=len).unstack())
DYSLIPIDEMIA_p_value = round(EPILEPSY_SEIZURE_p_value, 5)
print("p-Value = ", EPILEPSY_SEIZURE_p_value)
p-Value =  0.647958729995

Hepatitis

In [98]:
chi2, HEPATITIS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.HEPATITIS, Data.pdc_flag],
                               aggfunc=len).unstack())
HEPATITIS_p_value = round(HEPATITIS_p_value, 5)
print("p-Value = ", HEPATITIS_p_value)
p-Value =  0.41139

HIV/AIDS

In [99]:
chi2, HIV_AIDS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.HIV_AIDS, Data.pdc_flag],
                               aggfunc=len).unstack())
HIV_AIDS_p_value = round(HIV_AIDS_p_value, 5)
print("p-Value = ", HIV_AIDS_p_value)
p-Value =  0.72309

Hypertension

In [100]:
chi2, HYPERTENSION_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.HYPERTENSION, Data.pdc_flag],
                               aggfunc=len).unstack())
HYPERTENSION_p_value = round(HYPERTENSION_p_value, 5)
print("p-Value = ", HYPERTENSION_p_value)
p-Value =  0.0

Liver Gallbladder Pancreas

In [101]:
chi2, LIVER_GALLBLADDER_PANCREAS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.LIVER_GALLBLADDER_PANCREAS, Data.pdc_flag],
                               aggfunc=len).unstack())
LIVER_GALLBLADDER_PANCREAS_p_value = round(LIVER_GALLBLADDER_PANCREAS_p_value, 5)
print("p-Value = ", LIVER_GALLBLADDER_PANCREAS_p_value)
p-Value =  0.90219

Myocardial Infarction/CAD

In [102]:
chi2, MI_CAD_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.MI_CAD, Data.pdc_flag],
                               aggfunc=len).unstack())
MI_CAD_p_value = round(MI_CAD_p_value, 5)
print("p-Value = ", MI_CAD_p_value)
p-Value =  0.56749

Osteoarthritis

In [103]:
chi2, OSTEOARTHRITIS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.OSTEOARTHRITIS, Data.pdc_flag],
                               aggfunc=len).unstack())
OSTEOARTHRITIS_p_value = round(OSTEOARTHRITIS_p_value, 5)
print("p-Value = ", OSTEOARTHRITIS_p_value)
p-Value =  0.79526

Paralysis/hemiplegia/paraplegia

In [104]:
chi2, PARALYSIS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.PARALYSIS, Data.pdc_flag],
                               aggfunc=len).unstack())
PARALYSIS_p_value = round(PARALYSIS_p_value, 5)
print("p-Value = ", PARALYSIS_p_value)
p-Value =  0.89635

Peptic ulcer disease

In [105]:
chi2, PEPTIC_ULCER_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.PEPTIC_ULCER, Data.pdc_flag],
                               aggfunc=len).unstack())
PEPTIC_ULCER_p_value = round(PEPTIC_ULCER_p_value, 5)
print("p-Value = ", PEPTIC_ULCER_p_value)
p-Value =  0.64715

Peripheral vascular disease

In [106]:
chi2, PERIPHERAL_VASCULAR_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.PERIPHERAL_VASCULAR, Data.pdc_flag],
                               aggfunc=len).unstack())
PERIPHERAL_VASCULAR_p_value = round(PERIPHERAL_VASCULAR_p_value, 5)
print("p-Value = ", PERIPHERAL_VASCULAR_p_value)
p-Value =  0.15791

Renal failure/dialysis

In [108]:
chi2, RENAL_FAILURE_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.RENAL_FAILURE, Data.pdc_flag],
                               aggfunc=len).unstack())
RENAL_FAILURE_p_value = round(RENAL_FAILURE_p_value, 5)
print("p-Value = ", RENAL_FAILURE_p_value)
p-Value =  0.58787

Rheumatologic disease (SLE, RA, AS, PSA)

In [109]:
chi2, RHEUMATOLOGIC_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.RHEUMATOLOGIC, Data.pdc_flag],
                               aggfunc=len).unstack())
RHEUMATOLOGIC_p_value = round(RHEUMATOLOGIC_p_value, 5)
print("p-Value = ", RHEUMATOLOGIC_p_value)
p-Value =  0.52855

Schizophrenia

In [110]:
chi2, SCHIZOPHRENIA_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.SCHIZOPHRENIA, Data.pdc_flag],
                               aggfunc=len).unstack())
SCHIZOPHRENIA_p_value = round(SCHIZOPHRENIA_p_value, 5)
print("p-Value = ", SCHIZOPHRENIA_p_value)
p-Value =  nan

Sleep disorders

In [111]:
chi2, SLEEP_DISORDERS_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.SLEEP_DISORDERS, Data.pdc_flag],
                               aggfunc=len).unstack())
SLEEP_DISORDERS_p_value = round(SLEEP_DISORDERS_p_value, 5)
print("p-Value = ", SLEEP_DISORDERS_p_value)
p-Value =  0.40821

Smoking or history of smoking

In [112]:
chi2, SMOKING_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.SMOKING, Data.pdc_flag],
                               aggfunc=len).unstack())
SMOKING_p_value = round(SMOKING_p_value, 5)
print("p-Value = ", SMOKING_p_value)
p-Value =  0.13449

Thyroid disease

In [113]:
chi2, THYROID_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.THYROID, Data.pdc_flag],
                               aggfunc=len).unstack())
THYROID_p_value = round(THYROID_p_value, 5)
print("p-Value = ", THYROID_p_value)
p-Value =  0.10619

Solid Tumor

In [114]:
chi2, Solid_Tumor_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Solid_Tumor, Data.pdc_flag],
                               aggfunc=len).unstack())
Solid_Tumor_p_value = round(Solid_Tumor_p_value, 5)
print("p-Value = ", Solid_Tumor_p_value)
p-Value =  0.48818

Metastatic disease

In [115]:
chi2, Metastatic_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Metastatic, Data.pdc_flag],
                               aggfunc=len).unstack())
Metastatic_p_value = round(Metastatic_p_value, 5)
print("p-Value = ", Metastatic_p_value)
p-Value =  0.919

Leukemia/Lymphoma

In [116]:
chi2, Leukemia_Lymphoma_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Leukemia_Lymphoma, Data.pdc_flag],
                               aggfunc=len).unstack())
Leukemia_Lymphoma_p_value = round(Leukemia_Lymphoma_p_value, 5)
print("p-Value = ", Leukemia_Lymphoma_p_value)
p-Value =  0.919

Other Cancer (excludes in-situ)

In [117]:
chi2, Other_Cancer_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Other_Cancer, Data.pdc_flag],
                               aggfunc=len).unstack())
Other_Cancer_p_value = round(Other_Cancer_p_value, 5)
print("p-Value = ", Other_Cancer_p_value)
p-Value =  0.48507

Cancer In-Situ

In [118]:
chi2, Cancer_In_Situ_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Cancer_In_Situ, Data.pdc_flag],
                               aggfunc=len).unstack())
Cancer_In_Situ_p_value = round(Cancer_In_Situ_p_value, 5)
print("p-Value = ", Cancer_In_Situ_p_value)
p-Value =  0.919

Charlson Comorbidity Index (CCI) Score, pre-index

In [119]:
chi2, pre_CCI_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.pre_CCI, Data.pdc_flag],
                               aggfunc=len).unstack())
pre_CCI_p_value = round(pre_CCI_p_value, 5)
print("p-Value = ", pre_CCI_p_value)
p-Value =  nan

Inpatient costs per person

In [158]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.pre_ip_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Inpatient Costs',  size=12, weight='bold')

sns.distplot(Data.pre_ip_cost, bins=20)
sns.rugplot(Data.pre_ip_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Inpatient Costs',  size=12, weight='bold')
plt.show();
In [176]:
# T-Test:
tstat, pre_ip_cost_p_value = ttest_ind(PDC_0.pre_ip_cost, PDC_1.pre_ip_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, pre_ip_cost_p_valueZ = ranksums(PDC_0.pre_ip_cost, PDC_1.pre_ip_cost)

print("p-Value, t-Test =", pre_ip_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_ip_cost_p_valueZ)
p-Value, t-Test = 0.250306497137
p-Value, Wilcoxon Rank-Sum Test = 0.440394881705

Emergency costs per person

In [159]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.pre_er_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Emergency Costs',  size=12, weight='bold')

sns.distplot(Data.pre_er_cost, bins=20)
sns.rugplot(Data.pre_er_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Emergency Costs',  size=12, weight='bold')
plt.show();
In [177]:
# T-Test:
tstat, pre_er_cost_p_value = ttest_ind(PDC_0.pre_er_cost, PDC_1.pre_er_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, pre_er_cost_p_valueZ = ranksums(PDC_0.pre_er_cost, PDC_1.pre_er_cost)

print("p-Value, t-Test =", pre_er_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_er_cost_p_valueZ)
p-Value, t-Test = 0.228329017445
p-Value, Wilcoxon Rank-Sum Test = 0.00772250277091

Pharmacy costs per person

In [178]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.pre_rx_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Pharmacy Costs',  size=12, weight='bold')

sns.distplot(Data.pre_rx_cost, bins=20)
sns.rugplot(Data.pre_rx_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Pharmacy Costs',  size=12, weight='bold')
plt.show();
In [179]:
# T-Test:
tstat, pre_rx_cost_p_value = ttest_ind(PDC_0.pre_rx_cost, PDC_1.pre_rx_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, pre_rx_cost_p_valueZ = ranksums(PDC_0.pre_rx_cost, PDC_1.pre_rx_cost)

print("p-Value, t-Test =", pre_rx_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_rx_cost_p_valueZ)
p-Value, t-Test = 0.00937986468347
p-Value, Wilcoxon Rank-Sum Test = 1.23080937571e-08

Outpatient costs per person

In [161]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.pre_op_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Outpatient Costs',  size=12, weight='bold')

sns.distplot(Data.pre_op_cost, bins=20)
sns.rugplot(Data.pre_op_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Outpatient Costs',  size=12, weight='bold')
plt.show();
In [180]:
# T-Test:
tstat, pre_op_cost_p_value = ttest_ind(PDC_0.pre_op_cost, PDC_1.pre_op_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, pre_op_cost_p_valueZ = ranksums(PDC_0.pre_op_cost, PDC_1.pre_op_cost)

print("p-Value, t-Test =", pre_op_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_op_cost_p_valueZ)
p-Value, t-Test = 0.108600736428
p-Value, Wilcoxon Rank-Sum Test = 0.216433449663

Total healthcare costs per person

In [162]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.pre_total_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Total Healthcare Costs',  size=12, weight='bold')

sns.distplot(Data.pre_total_cost, bins=20)
sns.rugplot(Data.pre_total_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Total Healthcare Costs',  size=12, weight='bold')
plt.show();
In [181]:
# T-Test:
tstat, pre_total_cost_p_value = ttest_ind(PDC_0.pre_total_cost, PDC_1.pre_total_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, pre_total_cost_p_valueZ = ranksums(PDC_0.pre_total_cost, PDC_1.pre_total_cost)

print("p-Value, t-Test =", pre_total_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_total_cost_p_valueZ)
p-Value, t-Test = 0.36963432662
p-Value, Wilcoxon Rank-Sum Test = 0.389942846024

Total medical costs per person

In [164]:
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.pre_medical_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Total Medical Costs',  size=12, weight='bold')

sns.distplot(Data.pre_medical_cost, bins=20)
sns.rugplot(Data.pre_medical_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Total Medical Costs',  size=12, weight='bold')
plt.show();
In [182]:
# T-Test:
tstat, pre_medical_cost_p_value = ttest_ind(PDC_0.pre_medical_cost, PDC_1.pre_medical_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, pre_medical_cost_p_valueZ = ranksums(PDC_0.pre_medical_cost, PDC_1.pre_medical_cost)

print("p-Value, t-Test =", pre_medical_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", pre_medical_cost_p_valueZ)
p-Value, t-Test = 0.0763672970382
p-Value, Wilcoxon Rank-Sum Test = 0.0233895088677

Inpatient hospital stays

In [120]:
chi2, num_ip_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.num_ip, Data.pdc_flag],
                               aggfunc=len).unstack())
num_ip_p_value = round(num_ip_p_value, 5)
print("p-Value = ", num_ip_p_value)
p-Value =  0.29358

Total Inpatient days

In [121]:
chi2, total_los_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.total_los, Data.pdc_flag],
                               aggfunc=len).unstack())
total_los_p_value = round(total_los_p_value, 5)
print("p-Value = ", total_los_p_value)
p-Value =  nan

Total Outpatient visits

In [122]:
chi2, num_op_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.num_op, Data.pdc_flag],
                               aggfunc=len).unstack())
num_op_p_value = round(num_op_p_value, 5)
print("p-Value = ", num_op_p_value)
p-Value =  nan

Total Emergency visits

In [123]:
chi2, num_er_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.num_er, Data.pdc_flag],
                               aggfunc=len).unstack())
num_er_p_value = round(num_er_p_value, 5)
print("p-Value = ", num_er_p_value)
p-Value =  nan

Total Number of Prescriptions

In [124]:
chi2, num_ndc_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.num_ndc, Data.pdc_flag],
                               aggfunc=len).unstack())
num_ndc_p_value = round(num_ndc_p_value, 5)
print("p-Value = ", num_ndc_p_value)
p-Value =  nan

Number of Unique GPI6s for Prescriptions Filled

In [125]:
chi2, num_gpi6_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.num_gpi6, Data.pdc_flag],
                               aggfunc=len).unstack())
num_gpi6_p_value = round(num_gpi6_p_value, 5)
print("p-Value = ", num_gpi6_p_value)
p-Value =  nan

Generic dispensing rate

In [165]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.generic_rate, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Generic Rate',  size=12, weight='bold')

sns.distplot(Data.generic_rate, bins=20)
sns.rugplot(Data.generic_rate, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Generic Rate',  size=12, weight='bold')
plt.show();
In [183]:
# T-Test:
tstat, generic_rate_p_value = ttest_ind(PDC_0.generic_rate, PDC_1.generic_rate, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, generic_rate_p_valueZ = ranksums(PDC_0.generic_rate, PDC_1.generic_rate)

print("p-Value, t-Test =", generic_rate_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", generic_rate_p_valueZ)
p-Value, t-Test = 0.336622111445
p-Value, Wilcoxon Rank-Sum Test = 0.0914492808683

30 day adjusted fill rate

In [166]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.adjust_total_30d, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Adjusted Fill Rate',  size=12, weight='bold')

sns.distplot(Data.adjust_total_30d, bins=20)
sns.rugplot(Data.adjust_total_30d, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Adjusted Fill Rate',  size=12, weight='bold')
plt.show();
In [184]:
# T-Test:
tstat, adjust_total_30d_p_value = ttest_ind(PDC_0.adjust_total_30d, PDC_1.adjust_total_30d, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, adjust_total_30d_p_valueZ = ranksums(PDC_0.adjust_total_30d, PDC_1.adjust_total_30d)

print("p-Value, t-Test =", adjust_total_30d_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", adjust_total_30d_p_valueZ)
p-Value, t-Test = 3.8226952241e-17
p-Value, Wilcoxon Rank-Sum Test = 9.8410528174e-23

Patients with ≥ 1 hospital admission

In [126]:
chi2, pre_ip_flag_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.pre_ip_flag, Data.pdc_flag],
                               aggfunc=len).unstack())
pre_ip_flag_p_value = round(pre_ip_flag_p_value, 5)
print("p-Value = ", pre_ip_flag_p_value)
p-Value =  0.09766

Patients with ≥ 1 ER visit

In [127]:
chi2, pre_er_flag_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.pre_er_flag, Data.pdc_flag],
                               aggfunc=len).unstack())
pre_er_flag_p_value = round(pre_er_flag_p_value, 5)
print("p-Value = ", pre_er_flag_p_value)
p-Value =  1e-05

Log inpatient costs

In [167]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.log_pre_ip_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Inpatient Costs',  size=12, weight='bold')

sns.distplot(Data.log_pre_ip_cost, bins=20)
sns.rugplot(Data.log_pre_ip_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Log Inpatient Costs',  size=12, weight='bold')
plt.show();
In [185]:
# T-Test:
tstat, log_pre_ip_cost_p_value = ttest_ind(PDC_0.log_pre_ip_cost, PDC_1.log_pre_ip_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, log_pre_ip_cost_p_valueZ = ranksums(PDC_0.log_pre_ip_cost, PDC_1.log_pre_ip_cost)

print("p-Value, t-Test =", log_pre_ip_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_ip_cost_p_valueZ)
p-Value, t-Test = 0.0819561738289
p-Value, Wilcoxon Rank-Sum Test = 0.440394881705

Log ER costs

In [168]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.log_pre_er_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log ER Costs',  size=12, weight='bold')

sns.distplot(Data.log_pre_er_cost, bins=20)
sns.rugplot(Data.log_pre_er_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Log ER Costs',  size=12, weight='bold')
plt.show();
In [186]:
# T-Test:
tstat, log_pre_er_cost_p_value = ttest_ind(PDC_0.log_pre_er_cost, PDC_1.log_pre_er_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, log_pre_er_cost_p_valueZ = ranksums(PDC_0.log_pre_er_cost, PDC_1.log_pre_er_cost)

print("p-Value, t-Test =", log_pre_er_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_er_cost_p_valueZ)
p-Value, t-Test = 4.08911613057e-05
p-Value, Wilcoxon Rank-Sum Test = 0.00772250277091

Log outpatient costs

In [169]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.log_pre_op_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Outpatient Costs',  size=12, weight='bold')

sns.distplot(Data.log_pre_op_cost, bins=20)
sns.rugplot(Data.log_pre_op_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Log Outpatient Costs',  size=12, weight='bold')
plt.show();
In [187]:
# T-Test:
tstat, log_pre_op_cost_p_value = ttest_ind(PDC_0.log_pre_op_cost, PDC_1.log_pre_op_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, log_pre_op_cost_p_valueZ = ranksums(PDC_0.log_pre_op_cost, PDC_1.log_pre_op_cost)

print("p-Value, t-Test =", log_pre_op_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_op_cost_p_valueZ)
p-Value, t-Test = 0.662943248175
p-Value, Wilcoxon Rank-Sum Test = 0.216433449663

Log pharmacy costs

In [170]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.log_pre_rx_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Log Pharmacy Costs',  size=12, weight='bold')

sns.distplot(Data.log_pre_rx_cost, bins=20)
sns.rugplot(Data.log_pre_rx_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Log Pharmacy Costs',  size=12, weight='bold')
plt.show();
In [188]:
# T-Test:
tstat, log_pre_rx_cost_p_value = ttest_ind(PDC_0.log_pre_rx_cost, PDC_1.log_pre_rx_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, log_pre_rx_cost_p_valueZ = ranksums(PDC_0.log_pre_rx_cost, PDC_1.log_pre_rx_cost)

print("p-Value, t-Test =", log_pre_rx_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", log_pre_rx_cost_p_valueZ)
p-Value, t-Test = 1.20616954611e-08
p-Value, Wilcoxon Rank-Sum Test = 1.23080937571e-08

Pre-index total costs decile

In [128]:
chi2, pre_total_cat_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.pre_total_cat, Data.pdc_flag],
                               aggfunc=len).unstack())
pre_total_cat_p_value = round(pre_total_cat_p_value, 5)
print("p-Value = ", pre_total_cat_p_value)
p-Value =  0.28022

Generic scripts filled

In [129]:
chi2, numofgen_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.numofgen, Data.pdc_flag],
                               aggfunc=len).unstack())
numofgen_p_value = round(numofgen_p_value, 5)
print("p-Value = ", numofgen_p_value)
p-Value =  nan

Brand scripts filled

In [130]:
chi2, numofbrand_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.numofbrand, Data.pdc_flag],
                               aggfunc=len).unstack())
numofbrand_p_value = round(numofbrand_p_value, 5)
print("p-Value = ", numofbrand_p_value)
p-Value =  nan

Cost of generic scripts filled

In [171]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.generic_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Generic Scripts Costs',  size=12, weight='bold')

sns.distplot(Data.generic_cost, bins=20)
sns.rugplot(Data.generic_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Generic Scripts Costs',  size=12, weight='bold')
plt.show();
In [189]:
# T-Test:
tstat, generic_cost_p_value = ttest_ind(PDC_0.generic_cost, PDC_1.generic_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, generic_cost_p_valueZ = ranksums(PDC_0.generic_cost, PDC_1.generic_cost)

print("p-Value, t-Test =", generic_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", generic_cost_p_valueZ)
p-Value, t-Test = 0.000279106162701
p-Value, Wilcoxon Rank-Sum Test = 1.23307339665e-11

Cost of brand scripts filled

In [172]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.brand_cost, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Brand Scripts Costs',  size=12, weight='bold')

sns.distplot(Data.brand_cost, bins=20)
sns.rugplot(Data.brand_cost, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Brand Scripts Costs',  size=12, weight='bold')
plt.show();
In [190]:
# T-Test:
tstat, brand_cost_p_value = ttest_ind(PDC_0.brand_cost, PDC_1.brand_cost, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, brand_cost_p_valueZ = ranksums(PDC_0.brand_cost, PDC_1.brand_cost)

print("p-Value, t-Test =", brand_cost_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", brand_cost_p_valueZ)
p-Value, t-Test = 0.029669029244
p-Value, Wilcoxon Rank-Sum Test = 0.000198724217999

Ratio of generic to total pharmacy costs

In [173]:
mpl.rcParams['figure.figsize'] = (14, 6)
fig, ax = plt.subplots(1, 2)

sm.qqplot(Data.ratio_G_total_cost_post, line='q', ax=ax[0])
ax[0].set_title('Q-Q Plot of Generic/Pharmacy Ratio Costs',  size=12, weight='bold')

sns.distplot(Data.ratio_G_total_cost_post, bins=20)
sns.rugplot(Data.ratio_G_total_cost_post, vertical=False, alpha= 0.04, color='red')

ax[1].set_title('Distribution Plot of Generic/Pharmacy Ratio Costs',  size=12, weight='bold')
plt.show();
In [191]:
# T-Test:
tstat, ratio_G_total_cost_post_p_value = ttest_ind(PDC_0.ratio_G_total_cost_post,
                                                   PDC_1.ratio_G_total_cost_post, equal_var=True)

# Wilcoxon Rank-Sum Test:
zstat, ratio_G_total_cost_post_p_valueZ = ranksums(PDC_0.ratio_G_total_cost_post, PDC_1.ratio_G_total_cost_post)

print("p-Value, t-Test =", ratio_G_total_cost_post_p_value)
print("p-Value, Wilcoxon Rank-Sum Test =", ratio_G_total_cost_post_p_valueZ)
p-Value, t-Test = 0.401820328289
p-Value, Wilcoxon Rank-Sum Test = 0.757462178762

Age 18-44

In [131]:
chi2, Age_18_44_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Age_18_44, Data.pdc_flag],
                               aggfunc=len).unstack())
Age_18_44_p_value = round(Age_18_44_p_value, 5)
print("p-Value = ", Age_18_44_p_value)
p-Value =  0.0

Age 45-64

In [132]:
chi2, Age_45_64_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Age_45_64, Data.pdc_flag],
                               aggfunc=len).unstack())
Age_45_64_p_value = round(Age_45_64_p_value, 5)
print("p-Value = ", Age_45_64_p_value)
p-Value =  0.0

Age 65+

In [133]:
chi2, Age_65_Up_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Age_65_Up, Data.pdc_flag],
                               aggfunc=len).unstack())
Age_65_Up_p_value = round(Age_65_Up_p_value, 5)
print("p-Value = ", Age_65_Up_p_value)
p-Value =  3e-05

Male

In [135]:
chi2, Male_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Male, Data.pdc_flag],
                               aggfunc=len).unstack())
Male_p_value = round(Male_p_value, 5)
print("p-Value = ", Male_p_value)
p-Value =  8e-05

Female

In [136]:
chi2, Female_p_value, dof, expected = chi2_contingency(LogExtract.pivot_table(['pdc_flag'], 
                               index=[LogExtract.Female, LogExtract.pdc_flag],
                               aggfunc=len).unstack())
Female_p_value = round(Female_p_value, 5)
print("p-Value = ", Female_p_value)
p-Value =  8e-05

Northeast

In [137]:
chi2, Reg_NE_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Reg_NE, Data.pdc_flag],
                               aggfunc=len).unstack())
Reg_NE_p_value = round(Reg_NE_p_value, 5)
print("p-Value = ", Reg_NE_p_value)
p-Value =  0.0108

Midwest

In [138]:
chi2, Reg_MW_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Reg_MW, Data.pdc_flag],
                               aggfunc=len).unstack())
Reg_MW_p_value = round(Reg_MW_p_value, 5)
print("p-Value = ", Reg_MW_p_value)
p-Value =  0.08143

South

In [139]:
chi2, Reg_S_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Reg_S, Data.pdc_flag],
                               aggfunc=len).unstack())
Reg_S_p_value = round(Reg_S_p_value, 5)
print("p-Value = ", Reg_S_p_value)
p-Value =  0.00101

West

In [141]:
chi2, Reg_W_p_value, dof, expected = chi2_contingency(Data.pivot_table(['pdc_flag'], 
                               index=[Data.Reg_W, Data.pdc_flag],
                               aggfunc=len).unstack())
Reg_W_p_value = round(Reg_W_p_value, 5)
print("p-Value = ", Reg_W_p_value)
p-Value =  0.64507

Dropping our Features

We will build several dataframes to hold our features, each getting progressively smaller as the accepted level of significance decreases

In [142]:
Feat_Lvl_70 = Data

Feat_Lvl_70 = Data.drop('DEMENTIA', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('DIABETES', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('SCHIZOPHRENIA', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('pre_CCI', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_gpi6', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('total_los', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_op', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('numofgen', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_er', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('num_ndc', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('numofbrand', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('idx_copay', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('ALCOHOL_DRUG', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('ASTHMA', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('CEREBROVASCULAR', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('CHF', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('HIV_AIDS', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('LIVER_GALLBLADDER_PANCREAS', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('OSTEOARTHRITIS', axis=1)
Feat_Lvl_70 = Feat_Lvl_70.drop('PARALYSIS', axis=1)
In [143]:
Feat_Lvl_50 = Feat_Lvl_70

Feat_Lvl_50 = Feat_Lvl_70.drop('idx_prodtypeN', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('EPILEPSY_SEIZURE', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('MI_CAD', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('PEPTIC_ULCER', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('RENAL_FAILURE', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('RHEUMATOLOGIC', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('log_pre_op_cost', axis=1)
Feat_Lvl_50 = Feat_Lvl_50.drop('Reg_W', axis=1)
In [144]:
Feat_Lvl_30 = Feat_Lvl_50

Feat_Lvl_30 = Feat_Lvl_50.drop('CHRONIC_KIDNEY', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('CHRONIC_PAIN_FIBRO', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('COPD', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('HEPATITIS', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('SLEEP_DISORDERS', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('Solid_Tumor', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('Other_Cancer', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('pre_ip_cost', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('pre_total_cost', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('generic_rate', axis=1)
Feat_Lvl_30 = Feat_Lvl_30.drop('log_pre_ip_cost', axis=1)
In [145]:
Feat_Lvl_20 = Feat_Lvl_30

Feat_Lvl_20 = Feat_Lvl_30.drop('CARDIAC_ARRYTHMIA', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('CARDIAC_VALVULAR', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('DEPRESSION', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('num_ip', axis=1)
Feat_Lvl_20 = Feat_Lvl_20.drop('pre_total_cat', axis=1)
In [202]:
Feat_Lvl_10 = Feat_Lvl_20

#Feat_Lvl_10 = Feat_Lvl_20.drop('idx_paytypN', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('PERIPHERAL_VASCULAR', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('SMOKING', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('THYROID', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('pre_er_cost', axis=1)
Feat_Lvl_10 = Feat_Lvl_10.drop('pre_op_cost', axis=1)
In [147]:
Feat_Lvl_05 = Feat_Lvl_10
Feat_Lvl_05 = Feat_Lvl_05.drop('log_idx_copay', axis=1)
Feat_Lvl_05 = Feat_Lvl_05.drop('pre_ip_flag', axis=1)
Feat_Lvl_05 = Feat_Lvl_05.drop('Reg_MW', axis=1)
In [207]:
print(Feat_Lvl_05.shape[1])
Feat_Lvl_05.head(3)
25
Out[207]:
pdc_flag age_cat DYSLIPIDEMIA HYPERTENSION Metastatic Leukemia_Lymphoma Cancer_In_Situ pre_rx_cost pre_medical_cost adjust_total_30d pre_er_flag log_pre_er_cost log_pre_rx_cost generic_cost brand_cost ratio_G_total_cost numofgen_post numofbrand_post Age_18_44 Age_45_64 Age_65_Up Male Female Reg_NE Reg_S
1 0 2 0 0 0 0 0 0.585524 -0.238604 -0.146839 0 -0.446072 1.148968 -0.527138 0.726295 -1.356582 2 13 1 0 0 0 1 0 1
4 1 2 0 0 0 0 0 -0.212157 -0.360230 0.103884 0 -0.446072 0.390026 1.058492 -0.373082 1.083909 14 2 1 0 0 1 0 0 1
5 0 2 0 1 0 0 0 -0.234585 -0.303226 -0.594129 0 -0.446072 0.340297 0.314504 -0.291908 0.203710 6 0 1 0 0 1 0 0 1

In our R code, we got an extremely high p-value (0.581) for id_paytypN from the kruskal-williams test, so we removed it from our Feat_Lvl_10 data frame.

Feature Selection with Random Forests

In [165]:
from sklearn.ensemble import RandomForestRegressor

X = Data.iloc[:, 1:].values
y = Data.iloc[:, 0].values
Names = Data.columns[1:]
rf = RandomForestRegressor()
rf.fit(X, y)
print('Features sorted by their score:')
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), Names), reverse=True))
Features sorted by their score:
[(0.15179999999999999, 'adjust_total_30d'), (0.077799999999999994, 'numofgen_post'), (0.071300000000000002, 'generic_cost'), (0.045699999999999998, 'pre_total_cost'), (0.0395, 'idx_copay'), (0.037999999999999999, 'num_ndc'), (0.036200000000000003, 'pre_medical_cost'), (0.035400000000000001, 'log_idx_copay'), (0.034700000000000002, 'num_gpi6'), (0.033799999999999997, 'num_op'), (0.032399999999999998, 'numofbrand_post'), (0.0309, 'numofgen'), (0.030800000000000001, 'pre_rx_cost'), (0.0263, 'pre_op_cost'), (0.0258, 'generic_rate'), (0.0252, 'ratio_G_total_cost'), (0.023800000000000002, 'log_pre_op_cost'), (0.021600000000000001, 'log_pre_rx_cost'), (0.0212, 'age_cat'), (0.0129, 'brand_cost'), (0.011900000000000001, 'pre_CCI'), (0.0109, 'numofbrand'), (0.0097000000000000003, 'num_er'), (0.0094999999999999998, 'Reg_MW'), (0.0089999999999999993, 'pre_er_cost'), (0.0088999999999999999, 'idx_paytypN'), (0.0085000000000000006, 'DYSLIPIDEMIA'), (0.0074000000000000003, 'Reg_S'), (0.0066, 'SLEEP_DISORDERS'), (0.0061999999999999998, 'idx_prodtypeN'), (0.0054000000000000003, 'log_pre_er_cost'), (0.0045999999999999999, 'total_los'), (0.0045999999999999999, 'Male'), (0.0045999999999999999, 'Age_45_64'), (0.0044999999999999997, 'HYPERTENSION'), (0.0043, 'Female'), (0.0043, 'DIABETES'), (0.0041999999999999997, 'Age_65_Up'), (0.0040000000000000001, 'Reg_W'), (0.0040000000000000001, 'OSTEOARTHRITIS'), (0.0035999999999999999, 'Age_18_44'), (0.0033999999999999998, 'THYROID'), (0.0030000000000000001, 'pre_total_cat'), (0.0030000000000000001, 'DEPRESSION'), (0.0030000000000000001, 'CHRONIC_PAIN_FIBRO'), (0.0027000000000000001, 'pre_er_flag'), (0.0025999999999999999, 'PERIPHERAL_VASCULAR'), (0.0022000000000000001, 'SMOKING'), (0.0022000000000000001, 'ASTHMA'), (0.0020999999999999999, 'pre_ip_cost'), (0.0020999999999999999, 'CHRONIC_KIDNEY'), (0.002, 'log_pre_ip_cost'), (0.002, 'MI_CAD'), (0.0019, 'Reg_NE'), (0.0016000000000000001, 'Solid_Tumor'), (0.0016000000000000001, 'CARDIAC_VALVULAR'), (0.0016000000000000001, 'CARDIAC_ARRYTHMIA'), (0.0015, 'LIVER_GALLBLADDER_PANCREAS'), (0.0014, 'RENAL_FAILURE'), (0.0011000000000000001, 'Other_Cancer'), (0.001, 'PARALYSIS'), (0.00089999999999999998, 'HEPATITIS'), (0.00089999999999999998, 'COPD'), (0.00089999999999999998, 'ALCOHOL_DRUG'), (0.00080000000000000004, 'RHEUMATOLOGIC'), (0.00080000000000000004, 'PEPTIC_ULCER'), (0.00059999999999999995, 'CEREBROVASCULAR'), (0.00050000000000000001, 'DEMENTIA'), (0.00040000000000000002, 'pre_ip_flag'), (0.00020000000000000001, 'EPILEPSY_SEIZURE'), (0.0, 'num_ip'), (0.0, 'SCHIZOPHRENIA'), (0.0, 'Metastatic'), (0.0, 'Leukemia_Lymphoma'), (0.0, 'HIV_AIDS'), (0.0, 'Cancer_In_Situ'), (0.0, 'CHF')]
In [166]:
RF_Extract = Data[['adjust_total_30d', 'generic_cost', 'pre_total_cost', 'idx_copay', 'num_gpi6',
                   'num_ndc', 'pre_medical_cost', 'num_op', 'numofgen', 'pre_rx_cost', 'pre_op_cost', 'generic_rate',
                   'ratio_G_total_cost', 'age_cat', 'brand_cost', 'pre_CCI']]
In [203]:
Estimate = Data[['adjust_total_30d', 'generic_cost', 'pre_total_cost', 'log_idx_copay', 'num_gpi6',
                 'num_ndc', 'pre_medical_cost', 'num_op', 'numofgen', 'pre_rx_cost', 'pre_op_cost', 'generic_rate',
                 'ratio_G_total_cost', 'age_cat', 'brand_cost', 'pre_CCI', 'age_cat', 'DYSLIPIDEMIA', 'HYPERTENSION',
                 'Metastatic', 'Leukemia_Lymphoma', 'Cancer_In_Situ', 'pre_er_flag', 'log_pre_er_cost', 'Age_18_44',
                 'Age_45_64', 'Age_65_Up', 'Male', 'Female', 'Reg_NE', 'Reg_S', 'numofbrand']]

Logistic Regression Models

In [246]:
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler

#Convert dataframe values into correstponding X matrix and y vector
X, y = Estimate.iloc[:, 1:].values, Data.iloc[:, 0].values

#Convert X and y arrays into train and test arrays
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#Standardize the Matrix values to prepare for feature processing
#sc = StandardScaler()
#X_std = sc.fit_transform(X)
#X_train_std = sc.fit_transform(X_train)
#X_test_std = sc.fit_transform(X_test)
#X_std
In [249]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train, y_train)
Est_Accuracy = lr.score(X_test, y_test)
print('Accuracy:', lr.score(X_test, y_test))
Accuracy: 0.639393939394
In [251]:
X2, y2 = Data.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train2, y_train2)
Min_Accuracy = lr.score(X_test2, y_test2)
print('Accuracy:', lr.score(X_test2, y_test2))
Accuracy: 0.660606060606
In [227]:
print(lr.fit(X_train, y_train))
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)
In [228]:
from sklearn.metrics import precision_score, recall_score, f1_score

print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred))
print('F1: %.3f' % f1_score(y_true=y_test, y_pred=y_pred))
Precision: 0.653
Recall: 0.329
F1: 0.437
In [253]:
X3, y3 = Feat_Lvl_70.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train3, y_train3)
Below_70_Accuracy = lr.score(X_test3, y_test3)
print('Accuracy:', lr.score(X_test3, y_test3))
Accuracy: 0.651515151515
In [257]:
X4, y4 = Feat_Lvl_50.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train4, X_test4, y_train4, y_test4 = train_test_split(X4, y4, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train4, y_train4)
Below_50_Accuracy = lr.score(X_test4, y_test4)
print('Accuracy:', lr.score(X_test4, y_test4))
Accuracy: 0.639393939394
In [255]:
X5, y5 = Feat_Lvl_30.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train5, X_test5, y_train5, y_test5 = train_test_split(X5, y5, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train5, y_train5)
Below_30_Accuracy = lr.score(X_test5, y_test5)
print('Accuracy:', lr.score(X_test5, y_test5))
Accuracy: 0.621212121212
In [256]:
X6, y6 = Feat_Lvl_20.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train6, X_test6, y_train6, y_test6 = train_test_split(X6, y6, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train6, y_train6)
Below_20_Accuracy = lr.score(X_test6, y_test6)
print('Accuracy:', lr.score(X_test6, y_test6))
Accuracy: 0.639393939394
In [258]:
X7, y7 = Feat_Lvl_10.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train7, X_test7, y_train7, y_test7 = train_test_split(X7, y7, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train7, y_train7)
Below_10_Accuracy = lr.score(X_test7, y_test7)
print('Accuracy:', lr.score(X_test7, y_test7))
Accuracy: 0.624242424242
In [259]:
X8, y8 = Feat_Lvl_05.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train8, X_test8, y_train8, y_test8 = train_test_split(X8, y8, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train8, y_train8)
Below_05_Accuracy = lr.score(X_test8, y_test8)
print('Accuracy:', lr.score(X_test8, y_test8))
Accuracy: 0.636363636364
In [260]:
X9, y9 = RF_Extract.iloc[:, 1:].values, Data.iloc[:, 0].values
X_train9, X_test9, y_train9, y_test9 = train_test_split(X9, y9, test_size=0.2, random_state=0)

lr = LogisticRegression(penalty='l2', C=1)
lr.fit(X_train9, y_train9)
RF_Accuracy = lr.score(X_test9, y_test9)
print('Accuracy:', lr.score(X_test9, y_test9))
Accuracy: 0.612121212121
In [262]:
arrays = [['Logistic Model'],
          ['Accuracy Score']]
tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=['Model Type','Data Frame Type'])
In [265]:
Output = pd.DataFrame(0, index=['minimal feature reduction','p-value below 0.7', 'p-value below 0.5', 'p-value below 0.3',
                                'p-value below 0.2', 'p-value below 0.1', 'p-value below 0.05', 'Random Forest extract',
                                'combined features'],
                      columns=index)
Output
Out[265]:
Model Type Logistic Model
Data Frame Type Accuracy Score
minimal feature reduction 0
p-value below 0.7 0
p-value below 0.5 0
p-value below 0.3 0
p-value below 0.2 0
p-value below 0.1 0
p-value below 0.05 0
Random Forest extract 0
combined features 0
In [266]:
Output.loc[('minimal feature reduction'), ('Logistic Model', 'Accuracy Score')] = Min_Accuracy
Output.loc[('p-value below 0.7'), ('Logistic Model', 'Accuracy Score')] = Below_70_Accuracy
Output.loc[('p-value below 0.5'), ('Logistic Model', 'Accuracy Score')] = Below_50_Accuracy
Output.loc[('p-value below 0.3'), ('Logistic Model', 'Accuracy Score')] = Below_30_Accuracy
Output.loc[('p-value below 0.2'), ('Logistic Model', 'Accuracy Score')] = Below_20_Accuracy
Output.loc[('p-value below 0.1'), ('Logistic Model', 'Accuracy Score')] = Below_10_Accuracy
Output.loc[('p-value below 0.05'), ('Logistic Model', 'Accuracy Score')] = Below_05_Accuracy
Output.loc[('Random Forest extract'), ('Logistic Model', 'Accuracy Score')] = RF_Accuracy
Output.loc[('combined features'), ('Logistic Model', 'Accuracy Score')] = Est_Accuracy
Output
Out[266]:
Model Type Logistic Model
Data Frame Type Accuracy Score
minimal feature reduction 0.660606
p-value below 0.7 0.651515
p-value below 0.5 0.639394
p-value below 0.3 0.621212
p-value below 0.2 0.639394
p-value below 0.1 0.624242
p-value below 0.05 0.636364
Random Forest extract 0.612121
combined features 0.639394
In [268]:
Output.style.set_table_styles([{'selector': 'tr:hover', 'props': [('background-color', 'lavender')]},
                            {'selector': 'tr:nth-child(even)', 'props': [('background-color', 'lightgreen')]}])
Out[268]:
Logistic Model
Accuracy Score
minimal feature reduction 0.660606
p-value below 0.7 0.651515
p-value below 0.5 0.639394
p-value below 0.3 0.621212
p-value below 0.2 0.639394
p-value below 0.1 0.624242
p-value below 0.05 0.636364
Random Forest extract 0.612121
combined features 0.639394

Confusion Matrix

In [269]:
from sklearn.metrics import confusion_matrix

lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)
[[159  28]
 [ 91  52]]
In [270]:
fig, ax = plt.subplots(figsize=(4, 4))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.5)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show();

Pipeline (Logistic Regression and Principal Component Analysis) Predictor

In [215]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

pipe_lr = Pipeline([('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))])
pipe_lr.fit(X_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
Test Accuracy: 0.570

Comparing the Two Predictors Using 10-Fold Cross-Validation

Logistic Model

In [235]:
from sklearn.cross_validation import cross_val_score

scores = cross_val_score(estimator=lr, X=X_train, y=y_train, cv=10, n_jobs=1)
print('CV Accuracy Scores: %s' % scores)
print('\n')
print('CV Mean Accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
CV Accuracy Scores: [ 0.66917293  0.65909091  0.60606061  0.68939394  0.63636364  0.6969697
  0.59848485  0.68939394  0.56060606  0.64885496]


CV Mean Accuracy: 0.645 +/- 0.043

Pipeline Model

In [213]:
scores = cross_val_score(estimator=pipe_lr, X=X_train, y=y_train, cv=10, n_jobs=1)
print('CV Accuracy Scores: %s' % scores)
print('\n')
print('CV Mean Accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))
CV Accuracy Scores: [ 0.61654135  0.59848485  0.62121212  0.61363636  0.59090909  0.63636364
  0.61363636  0.61363636  0.60606061  0.57251908]


CV Mean Accuracy: 0.608 +/- 0.017

Diagnosing Bias and Variance with Learning Curves

In [233]:
from sklearn.learning_curve import learning_curve

pipe_lr = Pipeline([
        ('scl', StandardScaler()),
        ('clf', LogisticRegression(penalty='l2', random_state=0))])
train_sizes, train_scores, test_scores = learning_curve(estimator = pipe_lr, X = X_train, y = y_train,
                                                        train_sizes = np.linspace(0.1, 1, 10), cv = 10, n_jobs = 1)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(train_sizes, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')
plt.fill_between(train_sizes, train_mean + train_std, train_mean - train_std, alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean, color='green', linestyle='--', marker='s', markersize=5, label='validation accuracy')
plt.fill_between(train_sizes, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')

plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.legend(loc="lower right")
plt.show()

Hyperparameter Tuning with Validation Curves

In [234]:
from sklearn.learning_curve import validation_curve
param_range = [0.001, 0.01, 0.01, 1.0, 10.0, 100.0]
train_scores, test_scores = validation_curve(estimator = pipe_lr, X = X_train, y = y_train, param_name = 'clf__C',
                                             param_range = param_range, cv = 10)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.plot(param_range, train_mean, color='blue', marker='o', markersize=5, label='training accuracy')
plt.fill_between(param_range, train_mean + train_std, train_mean - train_std, alpha = 0.15, color='blue', )
plt.plot(param_range, test_mean, linestyle='--', color='green', marker='s', markersize=5, label='validation accuracy')
plt.fill_between(param_range, test_mean + test_std, test_mean - test_std, alpha=0.15, color='green')

plt.xscale('log')
plt.legend(loc='lower right')
plt.xlabel('parameter C')
plt.ylabel('Accuracy')
plt.show()